37 research outputs found

    R.ROSETTA: an interpretable machine learning framework.

    Get PDF
    Funder: Uppsala Universitet; doi: http://dx.doi.org/10.13039/501100007051Funder: Polska Akademia Nauk; doi: http://dx.doi.org/10.13039/501100004382Funder: Uppsala UniversityBACKGROUND: Machine learning involves strategies and algorithms that may assist bioinformatics analyses in terms of data mining and knowledge discovery. In several applications, viz. in Life Sciences, it is often more important to understand how a prediction was obtained rather than knowing what prediction was made. To this end so-called interpretable machine learning has been recently advocated. In this study, we implemented an interpretable machine learning package based on the rough set theory. An important aim of our work was provision of statistical properties of the models and their components. RESULTS: We present the R.ROSETTA package, which is an R wrapper of ROSETTA framework. The original ROSETTA functions have been improved and adapted to the R programming environment. The package allows for building and analyzing non-linear interpretable machine learning models. R.ROSETTA gathers combinatorial statistics via rule-based modelling for accessible and transparent results, well-suited for adoption within the greater scientific community. The package also provides statistics and visualization tools that facilitate minimization of analysis bias and noise. The R.ROSETTA package is freely available at https://github.com/komorowskilab/R.ROSETTA . To illustrate the usage of the package, we applied it to a transcriptome dataset from an autism case-control study. Our tool provided hypotheses for potential co-predictive mechanisms among features that discerned phenotype classes. These co-predictors represented neurodevelopmental and autism-related genes. CONCLUSIONS: R.ROSETTA provides new insights for interpretable machine learning analyses and knowledge-based systems. We demonstrated that our package facilitated detection of dependencies for autism-related genes. Although the sample application of R.ROSETTA illustrates transcriptome data analysis, the package can be used to analyze any data organized in decision tables

    Cancer LncRNA Census reveals evidence for deep functional conservation of long noncoding RNAs in tumorigenesis.

    Get PDF
    Long non-coding RNAs (lncRNAs) are a growing focus of cancer genomics studies, creating the need for a resource of lncRNAs with validated cancer roles. Furthermore, it remains debated whether mutated lncRNAs can drive tumorigenesis, and whether such functions could be conserved during evolution. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, we introduce the Cancer LncRNA Census (CLC), a compilation of 122 GENCODE lncRNAs with causal roles in cancer phenotypes. In contrast to existing databases, CLC requires strong functional or genetic evidence. CLC genes are enriched amongst driver genes predicted from somatic mutations, and display characteristic genomic features. Strikingly, CLC genes are enriched for driver mutations from unbiased, genome-wide transposon-mutagenesis screens in mice. We identified 10 tumour-causing mutations in orthologues of 8 lncRNAs, including LINC-PINT and NEAT1, but not MALAT1. Thus CLC represents a dataset of high-confidence cancer lncRNAs. Mutagenesis maps are a novel means for identifying deeply-conserved roles of lncRNAs in tumorigenesis

    Analyses of non-coding somatic drivers in 2,658 cancer whole genomes.

    Get PDF
    The discovery of drivers of cancer has traditionally focused on protein-coding genes1-4. Here we present analyses of driver point mutations and structural variants in non-coding regions across 2,658 genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium5 of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). For point mutations, we developed a statistically rigorous strategy for combining significance levels from multiple methods of driver discovery that overcomes the limitations of individual methods. For structural variants, we present two methods of driver discovery, and identify regions that are significantly affected by recurrent breakpoints and recurrent somatic juxtapositions. Our analyses confirm previously reported drivers6,7, raise doubts about others and identify novel candidates, including point mutations in the 5' region of TP53, in the 3' untranslated regions of NFKBIZ and TOB1, focal deletions in BRD4 and rearrangements in the loci of AKR1C genes. We show that although point mutations and structural variants that drive cancer are less frequent in non-coding genes and regulatory sequences than in protein-coding genes, additional examples of these drivers will be found as more cancer genomes become available

    Integrating multi-omics for type 2 diabetes : Data science and big data towards personalized medicine

    No full text
    Type 2 diabetes (T2D) is a complex metabolic disease characterized by multi-tissue insulin resistance and failure of the pancreatic ÎČ-cells to secrete sufficient amounts of insulin. Cells recruit transcription factors (TF) to specific genomic loci to regulate gene expression that consequently affects the protein and metabolite abundancies. Here we investigated the interplay of transcriptional and translational regulation, and its impact on metabolome and phenome for several insulin-resistant tissues from T2D donors. We implemented computational tools and multi-omics integrative approaches that can facilitate the selection of candidate combinatorial markers for T2D. We developed a data-driven approach to identify putative regulatory regions and TF-interaction complexes. The cell-specific sets of regulatory regions were enriched for disease-related single nucleotide polymorphisms (SNPs), highlighting the importance of such loci towards the genomic stability and the regulation of gene expression. We employed a similar principle in a second study where we integrated single nucleus ribonucleic acid sequencing (snRNA-seq) with bulk targeted chromosome-conformation-capture (HiCap) and mass spectrometry (MS) proteomics from liver. We identified a putatively polymorphic site that may contribute to variation in the pharmacogenetics of fluoropyrimidines toxicity for the DPYD gene. Additionally, we found a complex regulatory network between a group of 16 enhancers and the SLC2A2 gene that has been linked to increased risk for hepatocellular carcinoma (HCC). Moreover, three enhancers harbored motif-breaking mutations located in regulatory regions of a cohort of 314 HCC cases, and were candidate contributors to malignancy. In a cohort of 43 multi-organ donors we explored the alternating pattern of metabolites among visceral adipose tissue (VAT), pancreatic islets, skeletal muscle, liver and blood serum samples. A large fraction of lysophosphatidylcholines (LPC) decreased in muscle and serum of T2D donors, while a large number of carnitines increased in liver and blood of T2D donors, confirming that changes in metabolites occur in primary tissues, while their alterations in serum consist a secondary event. Next, we associated metabolite abundancies from 42 subjects to glucose uptake, fat content and volume of various organs measured by positron emission tomography/magnetic resonance imaging (PET/MRI). The fat content of the liver was positively associated with the amino acid tyrosine, and negatively associated with LPC(P-16:0). The insulin sensitivity of VAT and subcutaneous adipose tissue was positively associated with several LPCs, while the opposite applied to branch-chained amino acids. Finally, we presented the network visualization of a rule-based machine learning model that predicted non-diabetes and T2D in an “unseen” dataset with 78% accuracy

    Integrating multi-omics for type 2 diabetes : Data science and big data towards personalized medicine

    No full text
    Type 2 diabetes (T2D) is a complex metabolic disease characterized by multi-tissue insulin resistance and failure of the pancreatic ÎČ-cells to secrete sufficient amounts of insulin. Cells recruit transcription factors (TF) to specific genomic loci to regulate gene expression that consequently affects the protein and metabolite abundancies. Here we investigated the interplay of transcriptional and translational regulation, and its impact on metabolome and phenome for several insulin-resistant tissues from T2D donors. We implemented computational tools and multi-omics integrative approaches that can facilitate the selection of candidate combinatorial markers for T2D. We developed a data-driven approach to identify putative regulatory regions and TF-interaction complexes. The cell-specific sets of regulatory regions were enriched for disease-related single nucleotide polymorphisms (SNPs), highlighting the importance of such loci towards the genomic stability and the regulation of gene expression. We employed a similar principle in a second study where we integrated single nucleus ribonucleic acid sequencing (snRNA-seq) with bulk targeted chromosome-conformation-capture (HiCap) and mass spectrometry (MS) proteomics from liver. We identified a putatively polymorphic site that may contribute to variation in the pharmacogenetics of fluoropyrimidines toxicity for the DPYD gene. Additionally, we found a complex regulatory network between a group of 16 enhancers and the SLC2A2 gene that has been linked to increased risk for hepatocellular carcinoma (HCC). Moreover, three enhancers harbored motif-breaking mutations located in regulatory regions of a cohort of 314 HCC cases, and were candidate contributors to malignancy. In a cohort of 43 multi-organ donors we explored the alternating pattern of metabolites among visceral adipose tissue (VAT), pancreatic islets, skeletal muscle, liver and blood serum samples. A large fraction of lysophosphatidylcholines (LPC) decreased in muscle and serum of T2D donors, while a large number of carnitines increased in liver and blood of T2D donors, confirming that changes in metabolites occur in primary tissues, while their alterations in serum consist a secondary event. Next, we associated metabolite abundancies from 42 subjects to glucose uptake, fat content and volume of various organs measured by positron emission tomography/magnetic resonance imaging (PET/MRI). The fat content of the liver was positively associated with the amino acid tyrosine, and negatively associated with LPC(P-16:0). The insulin sensitivity of VAT and subcutaneous adipose tissue was positively associated with several LPCs, while the opposite applied to branch-chained amino acids. Finally, we presented the network visualization of a rule-based machine learning model that predicted non-diabetes and T2D in an “unseen” dataset with 78% accuracy

    Supplementary tables:MetaFetcheR: An R package for complete mapping of small compound data

    No full text
    Small-compound databases contain a large amount of information for metabolites and metabolic pathways. However, the plethora of such databases and the redundancy of their information lead to major issues with analysis and standardization. Lack of preventive establishment of means of data access at the infant stages of a project might lead to mislabelled compounds, reduced statistical power and large delays in delivery of results. We developed MetaFetcheR, an open-source R package that links metabolite data from several small-compound databases, resolves inconsistencies and covers a variety of use-cases of data fetching. We showed that the performance of MetaFetcheR was superior to existing approaches and databases by benchmarking the performance of the algorithm in three independent case studies based on two published datasets

    MetaFetcheR : An R Package for Complete Mapping of Small-Compound Data

    No full text
    Small-compound databases contain a large amount of information for metabolites and metabolic pathways. However, the plethora of such databases and the redundancy of their information lead to major issues with analysis and standardization. A lack of preventive establishment of means of data access at the infant stages of a project might lead to mislabelled compounds, reduced statistical power, and large delays in delivery of results. We developed MetaFetcheR, an open-source R package that links metabolite data from several small-compound databases, resolves inconsistencies, and covers a variety of use-cases of data fetching. We showed that the performance of MetaFetcheR was superior to existing approaches and databases by benchmarking the performance of the algorithm in three independent case studies based on two published datasets.Title in thesis list of papers:  MetaFetcheR: An R package for complete mapping of small compound data</p

    Multifaceted regulation of hepatic lipid metabolism by YY1

    No full text
    Recent studies suggested that dysregulated YY1 plays a pivotal role in many liver diseases. To obtain a detailed view of genes and pathways regulated by YY1 in the liver, we carried out RNA sequencing in HepG2 cells after YY1 knockdown. A rigid set of 2,081 differentially expressed genes was identified by comparing the YY1-knockdown samples (n = 8) with the control samples (n = 14). YY1 knockdown significantly decreased the expression of several key transcription factors and their coactivators in lipid metabolism. This is illustrated by YY1 regulating PPARA expression through binding to its promoter and enhancer regions. Our study further suggest that down-regulation of the key transcription factors together with YY1 knockdown significantly decreased the cooperation between YY1 and these transcription factors at various regulatory regions, which are important in regulating the expression of genes in hepatic lipid metabolism. This was supported by the finding that the expression of SCD and ELOVL6, encoding key enzymes in lipogenesis, were regulated by the cooperation between YY1 and PPARA/RXRA complex over their promoters
    corecore